GESIS Workshop: Introduction to Geospatial Techniques for Social Scientists in R
Stefan Jünger & Dennis Abel
2025-04-10
Now
Day
Time
Title
April 09
10:00-11:30
Introduction
April 09
11:30-11:45
Coffee Break
April 09
11:45-13:00
Data Formats
April 09
13:00-14:00
Lunch Break
April 09
14:00-15:30
Mapping I
April 09
15:30-15:45
Coffee Break
April 09
15:45-17:00
Spatial Wrangling
April 10
09:00-10:30
Mapping II
April 10
10:30-10:45
Coffee Break
April 10
10:45-12:00
Applied Spatial Linking
April 10
12:00-13:00
Lunch Break
April 10
13:00-14:30
Spatial Autocorrelation
April 10
14:30-14:45
Coffee Break
April 10
14:45-16:00
Spatial Econometrics & Outlook
Thus far
We’ve done some wrangling, mapping, and linking of geospatial data (with georeferenced survey data)
We’ve seen that geospatial data are relevant to provide context (as social scientists, we know that space is important), and they are nice to look at–we can tell a story!
However, geospatial data can be interesting on their own for social science studies!
Tobler’s first law of geography
[E]verything is related to everything else, but near things are more related than distant things (Tobler 1970, p. 236)1
This means nearby geographical regions, institutions, or people are more similar or have a stronger influence on each other.
What we get is an interdependent system.
Spatial Interdependence or Autocorrelation
Tobler’s law is the fundamental principle of doing spatial analysis. We want to know
If observations in our data are spatially interdependent
And how this interdependence can be explained (= data generation process)
Developing a model of connectiveness: the chess board
Rook and queen neighborhoods
It’s an interdependent system
Let’s do it hands-on: Our ‘research’ question
Say we are interested in AfD voting outcomes in relation to ethnic compositions of neighborhoods.
Combination of far-right voting research with Allport’s classic contact theory
We are just doing it in the Urban context of Cologne (again)
Neighbour list object:
Number of regions: 543
Number of nonzero links: 3120
Percentage nonzero weights: 1.058169
Average number of links: 5.745856
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 9 48 89 137 97 70 43 19 17 4 5 2 1 1
1 least connected region:
69 with 1 link
1 most connected region:
387 with 15 links
Unfortunately, we are not yet done with creating the links between neighborhoods. What we receive is, in principle, a huge matrix with connected observations.
That’s nothing we could plug into a statistical model, such as a regression or the like (see next session).
Normalization
Normalization is the process of creating actual spatial weights. There is a huge dispute on how to do it (Neumayer & Plümper, 2016)1. But nobody questions whether it should be done in the first place since, among others, it restricts the parameter space of the weights.
One of the disputed but, at the same time, standard procedures is row-normalization. It divides all individual weights (=connections between spatial units) \(w_{ij}\) by the row-wise sum of of all other weights:
Characteristics of weights list object:
Neighbour list object:
Number of regions: 543
Number of nonzero links: 3120
Percentage nonzero weights: 1.058169
Average number of links: 5.745856
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 9 48 89 137 97 70 43 19 17 4 5 2 1 1
1 least connected region:
69 with 1 link
1 most connected region:
387 with 15 links
Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 543 294849 543 201.1676 2261.458
Most and foremost, Moran’s I use the previously created weights between all spatial unit pairs \(w_{ij}\). It weights deviations from an overall mean value of connected pairs according to the strength of the modeled spatial relations. Moran’s I can be interpreted as a correlation coefficient (-1 = perfect negative spatial autocorrelation; +1 = perfect positive spatial autocorrelation).
Moran I test under randomisation
data: election_results$immigrant_share
weights: queens_W
Moran I statistic standard deviate = 20.897, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.5398961097 -0.0018450185 0.0006720411
Test of spatial autocorrelation: Geary’s C
Moran’s I is a global statistic for spatial autocorrelation. It can produce issues when there are only local clusters of spatial interdependence in the data. An alternative is the use of Geary's C:
As you can see, in the numerator, the average value \(\bar{x}\) is not as prominent as in Moran’s I. Geary’s C only produces values between 0 and 2 (value near 0 = positive spatial autocorrelation; 1 = no spatial autocorrelation; values near 2 = negative spatial autocorrelation).
Geary C test under randomisation
data: election_results$immigrant_share
weights: queens_W
Geary C statistic standard deviate = 16.951, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic Expectation Variance
0.4649079513 1.0000000000 0.0009965167
Modern inferface: sfdep package
The sfdep package provides a more tidyverse-compliant syntax to spatial weights. See:
Moran I test under randomisation
data: x
weights: listw
Moran I statistic standard deviate = 20.897, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.5398961097 -0.0018450185 0.0006720411
Geary C test under randomisation
data: x
weights: listw
Geary C statistic standard deviate = 16.951, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic Expectation Variance
0.4649079513 1.0000000000 0.0009965167
Measures of local spatial autocorrelation: LISA clusters
We show you the sfdep package because it provides nice functions to calculate local measures of spatial autocorrelation. One popular choice is the estimation of Local Indicators of Spatial Autocorrelation (i.e., LISA clusters). Most straightforwardly, they can be interpreted as case-specific indicators of spatial autocorrelation:
You now know how to model the connectedness of spatial units, investigate spatial autocorrelation globally and locally, and map it.
There’s way more, particularly regarding spatial weights (see exercise), clustering techniques (e.g., Hot Spot Analysis), or autocorrelation with more than one or two variables.
Nevertheless, now we know our data are spatially autocorrelated. Let’s try to find out why this is the case via some spatial econometrics